Linear regression: From scatter plots to a line for averages

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Learning Outcomes

  • The purpose of a linear regression
  • How to use R to:
    • Fit a simple linear regression model
    • Construct a regression model’s confidence intervals and hypothesis tests
    • Make predictions with a regression model
  • How to interpret a regression model’s best-fit line, confidence intervals, and hypothesis tests
  • Checking the assumptions for inference with a regression model with diagnostic plots

ˆyi as explained by xi from one sample

CS 10.1: Dungeness crab growth

Dungeness crabs are commercially fished between December and June along the Pacific coast of North America. Previously only male crabs were fished, which affected the population’s viability. The fisheries consulted with biologists to set the regulations for fishing female crabs.

One study to help inform these regulations was whether a female crab’s postmolt size is a good predictor of premolt size. This is because the size of a crab’s carapace is often used as a proxy for age.

Variables
Premolt A number denoting the size of the carapace before molting (in millimetres)
Postmolt A number denoting the size of the carapace after molting (in millimetres)
crabs.df <- read.csv("datasets/dungeness-crabs.csv")
nrow(crabs.df)
[1] 361
summary(crabs.df)
    Premolt         Postmolt    
 Min.   :102.1   Min.   :118.0  
 1st Qu.:120.9   1st Qu.:136.2  
 Median :130.4   Median :144.7  
 Mean   :129.2   Mean   :143.8  
 3rd Qu.:137.5   3rd Qu.:151.4  
 Max.   :155.1   Max.   :166.8  

Scatter plot of CS 10.1

xyplot(Premolt ~ Postmolt, data = crabs.df,
       main = "Premolt vs postmolt sizes of dungeness crabs",
       xlab = "Postmolt size (mm)", ylab = "Premolt size (mm)")

Figure: The premolt and postmolt sizes of 361 Dungeness crabs

A scatter plot helps us describe the direction (positive or negative) and type of relationship (linear, non-linear, or “none”)

Terminology used in regression

xyplot(Premolt ~ Postmolt, data = crabs.df,
       main = "Premolt vs postmolt sizes of dungeness crabs",
       xlab = "Postmolt size (mm)", ylab = "Premolt size (mm)")

Figure: The premolt and postmolt sizes of 361 Dungeness crabs

Response variable (Dependent variable)

Explanatory variable (Independent variable)

Briefly: Sample correlation1

xyplot(Premolt ~ Postmolt, data = crabs.df,
       main = "Premolt vs postmolt sizes of dungeness crabs",
       xlab = "Postmolt size (mm)", ylab = "Premolt size (mm)")

Figure: The premolt and postmolt sizes of 361 Dungeness crabs

A measure of the strength and direction of a linear association between two numeric variables.

cor.test( ~ Premolt + Postmolt, data = crabs.df)$estimate
      cor 
0.9831577 

Describing the sample correlation, \(r\)

None Weaker Stronger Perfect
|r| 0 0.5 1

A quick aside…

Fitting straight lines through (bivariate) data

The equation for a straight line

Recall the straight line equation

\[ y_i = mx_i + c \]

where:

  • \(y_i\) is value of the response variable for the \(i\)th observation
  • \(x_i\) is value of the explanatory variable for the \(i\)th observation
  • \(m\) is the “gradient” of the line if we increase \(x_i\) by one, that is, the amount that \(y_i\) changes as \(x_i\) increases
  • \(c\) is the y-intercept, that is, the value of \(y_i\) if \(x_i = 0\)

Figure: The premolt and postmolt sizes of 361 Dungeness crabs

The equation for the best-fit line

\[ y_i = \beta_0 + \beta_1 \times x_i + \varepsilon_i, ~ \text{where} ~ \varepsilon_i \sim \text{Normal}(0, \sigma_\varepsilon) \]

where:

  • \(y_i\) is value of the response variable for the \(i\)th observation
  • \(x_i\) is value of the explanatory variable for the \(i\)th observation
  • \(\beta_0\) is the y-intercept, that is, the value of the average value of \(y_i\) if \(x_i = 0\)
  • \(\beta_1\) is the “gradient” of the line if we increase \(x_i\) by one, that is, the amount that \(y_i\) changes on average as \(x_i\) increases
  • \(\varepsilon_i\) is the residual value for the \(i\)th observation
  • \(\text{Normal}(0, \sigma_\varepsilon)\) means that we expect the residuals Normally distributed about 0 and have a common standard deviation \(\sigma_\varepsilon\)

Visualising CS 10.1’s best-fit line

xyplot(Premolt ~ Postmolt, data = crabs.df, type = c("p", "r"), main = "Premolt vs postmolt sizes of dungeness crabs",
       xlab = "Postmolt size (mm)", ylab = "Premolt size (mm)", col.line = "black", lwd = 2)

Figure: The premolt and postmolt sizes of 361 Dungeness crabs

Using data to estimate the best-fit line

Is there a natural best guess for \(\beta_0\), \(\beta_1\), and \(\sigma_\varepsilon\) based on the data?

What we could do instead is “fit” the best-fit line, then use an appropriate stopping criteria which tells us when the best-fit line is achieved

The stopping criteria used for regression models involves minimising the “variability” of the residuals, that is, the sum of squares for residuals, \(SSR\)

\[ \DeclareMathOperator*{\argminA}{arg\,min} \begin{aligned} \argminA_{\beta_0,\,\beta_1} SSR, ~ \text{where} ~ SSR &= \sum^n_{i=1}(\varepsilon_i)^2 = \sum^n_{i=1}\{y_i - (\beta_0 + \beta_1\times x_i)\}^2 \end{aligned} \]

The equation for the best-fit line, in terms of ˆyi

\[ \widehat{y}_i = \beta_0 + \beta_1 \times x_i \]

where:

  • \(\widehat{y}_i\) is average value of the response variable for the \(i\)th observation
  • \(x_i\) is value of the explanatory variable for the \(i\)th observation
  • \(\beta_0\) is the y-intercept, that is, the value of \(\widehat{y}_i\) if \(x_i = 0\)
  • \(\beta_1\) is the “gradient” of the line if we increase \(x_i\) by one, that is, the amount that \(\widehat{y}_i\) changes as \(x_i\) increases

A visual explanation of Slide 13

Figure: The premolt and postmolt sizes of 361 Dungeness crabs

Assumptions for inference with a regression model

  1. Independent observations
  2. The observations’ residuals are centred at zero and they have a similar measure of spread
  3. The observations’ residuals is approximately Normally distributed

More on 2.

For the best-fit line, that is, a simple linear regression model, this implies that there is a linear association between the response and explanatory variable

So for the best-fit line only, the L.I.N.E. acronym is a convenient way of recalling what the assumptions are